# Prepare Leaf phenology type data from TRY for use

The leaf phenology type data from TRY informs on whether plant leaves are
evergreen, deciduous, semi-deciduous, semi-evergreen, or aphyllous (=without leaves).

*If you intend to clean more than one or two traits, we recommend the use
of the batch pre-processing script. Refer to the [TRY main page](try-label) for details.*

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

## Requirements

To run the script, the following is needed:
- TRY data, available <a href="https://planthub.idiv.de/downloads/" target="_parent">here</a>
- the data.table library may need to be installed

## Code

In [None]:
# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())


Let's get the TRY data

In [None]:
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Leaf phenology type"]


To get an overview of the data, we convert values to lowercase, sort them, and show them as
a table.

In [None]:
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]


Apparently, there are some trait categories mixed here. We will prepare a matrix with three columns
to separately save entries describing leaf phenology, leaf shape and plant succulence.

In [None]:
newVals <- matrix(NA, length(oriVals), 3)


There are also a couple of coded entries that need to be decoded. We find the categories corresponding
to the integer values used for coding in the "OriglName" field of the data. Let's check first where
we find specifically the number "1" or "y/yes".

In [None]:
datNames <- names(table(TRYSubset[oriVals == "1"]$OriglName))
for (i in seq_along(datNames)) {
	print(datNames[i])
	print(TRYSubset[OriglName == datNames[i]][1:2])
	print("--------------------------------------")
}
datNames <- names(table(TRYSubset[oriVals %in% c("y", "yes")]$OriglName))
for (i in seq_along(datNames)) {
	print(datNames[i])
	print(TRYSubset[OriglName == datNames[i]][1:2])
	print("--------------------------------------")
}


We conclude that there are datasets that can be distinguished by their "OriglName" that code phenology
differently. Taking this into account, we convert the data to string values. It looks like a good idea to remove
purely numeric values afterwards.

In [None]:
oriVals[TRYSubset$OriglName == "Decid 1, Everg 2, Mixed 3" & oriVals == "1"] <- "deciduous"
oriVals[TRYSubset$OriglName == "Decid 1, Everg 2, Mixed 3" & oriVals == "2"] <- "evergreen"
oriVals[TRYSubset$OriglName == "Decid 1, Everg 2, Mixed 3" & oriVals == "3"] <- "semi-deciduous"
oriVals[TRYSubset$OriglName %in% c("Leaf phenology: deciduous", "Deciduous") & oriVals %in% c("1", "y", "yes")] <-
	"deciduous"
oriVals[TRYSubset$OriglName %in% c("Leaf phenology: evergreen", "Evergreen") & oriVals %in% c("1", "y", "yes")] <-
	"evergreen"

oriVals[!grepl("[[:lower:]]", oriVals)] <- NA


The most important part of the cleaning process is the definition of the search strings to look for.
We use regular expressions in some cases to be more inclusive (or exclusive).

In [None]:
searchNames <- c(
	"(^| )evergr?ee?n|^e$|^ev$|persistent green|n\\.d\\.|overwintering green|hibernal|perennifoli|wintergreen",
	"(^| )dec(i|u)o?(c|d)uous|^d$|summer ?green|nonevergreen|aestival|caducifolio|autumngreen|springgree",
	"semi-?deciduous|semi-?evergreen",
	"aphyllous|leafless",
	"^succulent",
	"non-?\\s?succulent"
)


We can now search for the strings defined before and give names to the new categories.

In [None]:
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)

# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c("evergreen", "deciduous", "variable", "aphyllous", "succulent", "non-succulent")


Let's have a look at the results.

In [None]:
# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)


As some of these categories should be exclusive, we exclude all ambiguous data
by setting our search results to FALSE whenever we found more than
one match in our search.

In [None]:
searchResults[rowSums(searchResults[, c(1:4)]) > 1, c(1:4)] <- FALSE # evergreen, deciduous, variable, aphyllous
searchResults[rowSums(searchResults[, c(5:6)]) > 1, c(5:6)] <- FALSE # succulent, non-succulent


Now, we can create new strings with the cleaned values and add them to the observations. To
not remove the original entries, we will create a new column called "CleanedValueStr". We separate
entries relating to phenology, leaf shape, and succulence and change the trait name of the latter.

In [None]:
# use the searchResults matrix to create new value strings by concatenating all data found
newVals[, 1] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[c(1:3)][searchResults[x, c(1:3)]], collapse = ",")
})
newVals[, 2] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[4][searchResults[x, 4]], collapse = ",")
})
newVals[, 3] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[c(5, 6)][searchResults[x, c(5, 6)]], collapse = ",")
})
newVals[newVals == ""] <- NA

# move values to other traits
TRY[TraitName == "Leaf phenology type", CleanedValueStr := newVals[, 2]]
TRY[TraitName == "Leaf phenology type", TraitName := "gotoLeaf shape"]
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Leaf phenology type", CleanedValueStr := newVals[, 3]]
TRY[TraitName == "Leaf phenology type", TraitName := "gotoPlant succulence"]

# integrate into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Leaf phenology type", CleanedValueStr := newVals[, 1]]


As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase
in file size, we remove the rows of the duplicated data without values in the "CleanedValueStr" column.

In [None]:
TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]


We have used an existing trait name with the prefix "goto" to classify some data. This was done
to eventually move the data to the respective trait, but avoid another round of pre-processing.
So only run the following line if this is the last of various pre-processing scripts you want to use.

In [None]:
TRY[, TraitName := sub("^goto", "", TraitName)]


Let's write the data to a file.

In [None]:
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))
